⭐ 📅 📃 👉 📖 🤦‍♂️ 🖖 🤓

1 Captain’s log

Star date 71750.51. Our mission is to use R statistical software to extract star dates mentioned in the captain’s log from the scripts of Star Trek: The Next Generation and observe their progression over the course of the show’s seven seasons. There appears to be some mismatch in the frequency of digits after the decimal point – could this indicate poor abillity to choose random numbers? Or something more sinister? We shall venture deep into uncharted territory for answers…

We’re going to:

  • iterate reading in text files – containing Star Trek: The Next Generation scripts – to R and then extract stardates using the purrr and stringr packages
  • web scrape episode names using the rvest package and join them to the stardates data
  • tabulate and plot these interactively with ggplot2, plotly and DT

Disclimaer: there’s probably nothing new here for real Star Trek fans, but you might learn something new if you’re an R fan.

2 Make it so!

Start by downloading all the Star Trek: The Next Generation scripts from the Star Trek Minutiae website. These are provided in a zipped folder with 176 text files – one for each episode.

3 Energise!

Ready the workspace by loading the packages we’ll need for data manipulation.

library(readr)  # read files
library(purrr)  # iterate functions over files
library(stringr)  # manipulate strings
library(dplyr)  # data manipulation and pipe opeartor (%>%)
library(janitor)  # misc data manipulation and aggregation

4 Lieutenant Commander Data

We’re going to extract the content of the the text files using the read_lines() function from the readr package. We’ll iterate over each file with the map() function from the purrr package to read them into a list object where each element is a script.

scripts <- purrr::map(
  list.files(  # create vector of filepath strings to each file
    "data/scripts",  # file location of the text files 
    full.names = TRUE  # e.g. "data/scripts/102.txt"
    ),
  readr::read_lines # read the content from each filepath
  )

We can take a look at some example lines ([17:34]) from the title page of the first script (element [[1]]).

scripts[[1]][17:34]
##  [1] "                STAR TREK: THE NEXT GENERATION "        
##  [2] "                              "                         
##  [3] "                    \"Encounter at Farpoint\" "         
##  [4] "                              "                         
##  [5] "                              by "                      
##  [6] "                         D.C. Fontana "                 
##  [7] "                              and "                     
##  [8] "                       Gene Roddenberry "               
##  [9] ""                                                       
## [10] ""                                                       
## [11] "This script is not for publicaion or reproduction."     
## [12] "No one is authorized to dispose of the same. If lost or"
## [13] "destroyed, please notify the Script Department."        
## [14] ""                                                       
## [15] ""                                                       
## [16] "                         FINAL DRAFT"                   
## [17] ""                                                       
## [18] "                        April 13, 1987"

Our first example of a star date is in the Captain’s log voiceover in lines 46 to 50 of the first script.

scripts[[1]][46:50]
## [1] "\t\t\t\t\tPICARD V.O."                 
## [2] "\t\t\tCaptain's log, stardate 42353.7."
## [3] "\t\t\tOur destination is planet Cygnus"
## [4] "\t\t\tIV, beyond which lies the great" 
## [5] "\t\t\tunexplored mass of the galaxy."

5 Engage!

We want to extract stardate strings from each script as delivered in the captain’s voiceover. As you can see above, the stardates are given in the form ‘captiain’s log, stardate XXXXX.X’, where each X is a digit. We can start our search pattern with ‘date’ to help us avoid matching to strings that have a stardate-like pattern but aren’t stardates.

We can extract these with str_extract_all() from the stringr package, using a regex (regular expression). Our regex is written date[:space:][[:digit:]\\.[:digit:]]{7}. This means ‘find a string that starts with the word date followed by a space (date), which is followed by a string that contains digits ([:digit:]) with a period (\\.) inside, with a total length of seven characters ({7})’.

This will provide a list object where each element contains the regex-matched string for a script.

stardate_extract <- stringr::str_extract_all(  # extract all instances
  scripts,  # location from which to extract
  pattern = "date[:space:][[:digit:]\\.[:digit:]]{7}"  # regex
)

head(stardate_extract)  # see the first few list elements
## [[1]]
## [1] "date 42353.7" "date 42354.1" "date 42354.2" "date 42354.7"
## [5] "date 42372.5"
## 
## [[2]]
## [1] "date 41209.2" "date 41209.3"
## 
## [[3]]
## [1] "date 41235.2" "date 41235.3"
## 
## [[4]]
## [1] "date 41294.5" "date 41294.7"
## 
## [[5]]
## [1] "date 41263.1" "date 41263.2" "date 41263.3" "date 41263.4"
## 
## [[6]]
## [1] "date 41194.6" "date 41194.8"

We’re now going to tidy the data to:

  • turn the list into a dataframe (tibble::enframe()) with one row per episode
  • turn this into a dataframe with one row per stardate (tidyr::unnest())
  • rename the columns ‘episode’ and ‘stardate’ (dplyr::transmute()) and remove the instances of the string ‘date’ (stringr::str_replace())
  • create a season column that manually applies the season number to each row depending on its episode number (dplyr::mutate(dplyr::case_when()))
  • remove strings not in the form XXXXX.X (dplyr::mutate(dplyr::if_else()))
  • extract the digit after the decimal place in the stardate
  • remove any NAs (dplyr::filter())
stardate_tidy <- stardate_extract %>% 
  tibble::enframe() %>% 
  tidyr::unnest() %>% 
  dplyr::transmute(
    episode = name,
    stardate = stringr::str_replace(
      string = value,
      pattern = "date ",
      replacement = ""
    )
  ) %>% 
  dplyr::mutate(
    season = as.character(
      dplyr::case_when(
        episode %in% 1:25 ~ "1",
        episode %in% 26:47 ~ "2",
        episode %in% 48:73 ~ "3",
        episode %in% 74:99 ~ "4",
        episode %in% 100:125 ~ "5",
        episode %in% 126:151 ~ "6",
        episode %in% 152:176 ~ "7"
      )
    ),
    stardate = as.numeric(
      dplyr::if_else(
        condition = stardate %in% c("41148..", "40052..", "37650.."),
        true = "NA",
        false = stardate
      )
    ),
    stardate_decimal = as.numeric(
      str_sub(
        as.character(stardate),
        7,
        7
      )
    ),
    stardate_decimal = ifelse(
      is.na(stardate_decimal),
      0,
      stardate_decimal
    )
  ) %>% 
  dplyr::filter(!is.na(stardate))

dplyr::glimpse(stardate_tidy)
## Observations: 263
## Variables: 4
## $ episode          <int> 1, 1, 1, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 5, 5, ...
## $ stardate         <dbl> 42353.7, 42354.1, 42354.2, 42354.7, 42372.5, ...
## $ season           <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", ...
## $ stardate_decimal <dbl> 7, 1, 2, 7, 5, 2, 3, 2, 3, 5, 7, 1, 2, 3, 4, ...

6 Prepare a scanner probe!

So we have the season and episode numbers, but we don’t have the episode names. We could extract these from the scripts using regex as well, but another option is simply to scrape them from the Wikipedia page that gives episode information for The Next Generation. If ou visit that link, you’ll notice that the table sof episodes actually give a stardate, but they only provide one per episode – our script-scraping shows that many episodes have mulitple instances of stardates and voiceovers by the captain.

We can use the rvest package to perform the scrape. This works by supplying a website address and the path of the thing we want to extract – the episode name column of tables on the Wikipedia page. I used SelectorGadget – a point-and-click tool for finding the CSS selectors for elements of webpages – for this column in each of the tables on the Wikipedia page (.wikiepisodetable tr > :nth-child(3)).

library(rvest)

# Store website address
tng_ep_wiki <- rvest::html(
  "https://en.wikipedia.org/wiki/List_of_Star_Trek:_The_Next_Generation_episodes"
  )

# extract and tidy
tng_ep_names <- tng_ep_wiki %>%  # website address
  rvest::html_nodes(
    ".wikiepisodetable tr > :nth-child(3)"  # episode name column
  ) %>%
  rvest::html_text() %>%  # extract text
  dplyr::tibble() %>%  # to dataframe
  rename(episode_title = ".") %>%  # sensible column name
  dplyr::filter(episode_title != "Title") %>%  # remove table headers
  dplyr::mutate(episode = row_number())  # episode number (join key)

tng_ep_names
## # A tibble: 176 x 2
##    episode_title                      episode
##    <chr>                                <int>
##  1 "\"Encounter at Farpoint\""              1
##  2 "\"The Naked Now\""                      2
##  3 "\"Code of Honor\""                      3
##  4 "\"The Last Outpost\""                   4
##  5 "\"Where No One Has Gone Before\""       5
##  6 "\"Lonely Among Us\""                    6
##  7 "\"Justice\""                            7
##  8 "\"The Battle\""                         8
##  9 "\"Hide and Q\""                         9
## 10 "\"Haven\""                             10
## # ... with 166 more rows

So now we can join the episode names to the dataframe generated from the scripts. This gives us a table with a row per stardate extracted, with its associated season, episode number and episode name.

stardate_tidy_names <- left_join(
  x = stardate_tidy,
  y = tng_ep_names,
  by = "episode"
) %>% 
  select(season, episode, episode_title, stardate, stardate_decimal)

stardate_tidy_names
## # A tibble: 263 x 5
##    season episode episode_title               stardate stardate_decimal
##    <chr>    <int> <chr>                          <dbl>            <dbl>
##  1 1            1 "\"Encounter at Farpoint\""   42354.               7.
##  2 1            1 "\"Encounter at Farpoint\""   42354.               1.
##  3 1            1 "\"Encounter at Farpoint\""   42354.               2.
##  4 1            1 "\"Encounter at Farpoint\""   42355.               7.
##  5 1            1 "\"Encounter at Farpoint\""   42372.               5.
##  6 1            2 "\"The Naked Now\""           41209.               2.
##  7 1            2 "\"The Naked Now\""           41209.               3.
##  8 1            3 "\"Code of Honor\""           41235.               2.
##  9 1            3 "\"Code of Honor\""           41235.               3.
## 10 1            4 "\"The Last Outpost\""        41294.               5.
## # ... with 253 more rows

We can make these data into an interactive table with the DT::datatable htmlwidget. The output table can be searched (search box in upper right) and filtered (filters under each column) and the data copied or downloaded (CSV, Excel or PDF) using the buttons in the upper left of the table. You can choose to show 10, 25, 50 or all the rows with the ‘show x entries’ dropdown.

library(DT)

stardate_tidy_names %>% 
  mutate(
    season = as.factor(season),
    episode_title = as.factor(episode_title)
    ) %>% 
  DT::datatable(
    filter = "top",
    extensions = 'Buttons',
      options = list(
        autoWidth = TRUE,  # column width consistent when making selections
        dom = "Blfrtip",
        buttons = 
          list("copy", list(
            extend = "collection",
            buttons = c("csv", "excel", "pdf"),
            text = "Download"
          ) 
          ),
        # customize the length menu
        lengthMenu = list(
          c(10, 25, 50, -1), # declare values
          c(10, 25, 50, "All") # declare titles
        ), # end of lengthMenu customization
        pageLength = 10
      )
    )

7 On screen!

Let’s visualise the stardates by episode.

We can make this interactive using the plotly package – another htmlwidget for R – that conveneniently has teh funciton plotly::ggplotly that can turn a ggplot object into an interactive plot. You can hover over each point to find out more information about it.

Of course, there’s a package called ggsci that contains a discrete colour scale based on the colours of the shirts of the crew in teh first Star Trek series. Obviously we’ll use that here.

library(ggplot2)  # basic plotting
library(plotly)  # make plot interactive
library(ggsci)  # star trek colour scale
library(ggthemes)  # dark plot theme

# create basic plot
stardate_dotplot <- stardate_tidy_names %>% 
  ggplot2::ggplot() +
  geom_point(
    aes(
      x = episode,
      y = stardate,
      color = season,
      label = episode_title
    )
  ) +
  labs(title = "Stardates are almost (but not quite) chronological") +
  theme_solarized_2(light = FALSE) + 
  ggsci::scale_color_startrek()

# make plot interactive
stardate_dotplot %>% 
  plotly::ggplotly() %>% 
  layout(margin = list(l = 75))

So there was some nonlinearity between episodes of the first and second series and at the beginning of the third, but episodes become more chronological from that point onward.

Three points seem to be anomalous with stardates well before the present time period of the episode. Without spoiling them (too much), we can see that each of these episodes takes place in, or references, the past.

Identity Crisis (season 4, episode 91, stardate 40164.7) takes place partly in the past:

scripts[[91]][127:129]
## [1] "\tGEORDI moves into view, holding a Tricorder. (Note:"  
## [2] "\tGeordi is younger here, wearing a slightly different,"
## [3] "\tearlier version of his VISOR.)"

Dark Page (season 7, episode 158, stardate 30620.1) has a scene involving a diary:

scripts[[158]][2219:2235]
##  [1] "\t\t\tPerhaps we should go further back."               
##  [2] ""                                                       
##  [3] "\t\t\t\t\tTROI"                                         
##  [4] "\t\t\tThere's a lot to review. My"                      
##  [5] "\t\t\tmother's kept a journal since she"                
##  [6] "\t\t\twas first married..."                             
##  [7] ""                                                       
##  [8] "\t\t\t\t\tPICARD"                                       
##  [9] "\t\t\tWell, let's start at the"                         
## [10] "\t\t\tbeginning."                                       
## [11] ""                                                       
## [12] "\tHe sits at the computer, which shows a series of file"
## [13] "\tnumbers and dates. He works the controls."            
## [14] ""                                                       
## [15] "\t\t\t\t\tPICARD"                                       
## [16] "\t\t\tThe first entry seems to be"                      
## [17] "\t\t\tStardate 30620.1."

All Good Things (season 7, epiosde 176, stardate 41153.7) involves some time travel for Captain Picard:

scripts[[176]][1561:1569]
## [1] "\t\t\t\t\tPICARD (V.O.)"                 
## [2] "\t\t\tPersonal Log: Stardate 41153.7."   
## [3] "\t\t\tRecorded under security lockout"   
## [4] "\t\t\tOmega three-two-seven. I have"     
## [5] "\t\t\tdecided not to inform this crew of"
## [6] "\t\t\tmy experiences. If it's true that" 
## [7] "\t\t\tI've travelled to the past, I"     
## [8] "\t\t\tcannot risk giving them"           
## [9] "\t\t\tforeknowledge of what's to come."

8 Enhance!

Make a barplot of the frequency of the figure after the decimal place in the stardates.

stardate_tidy_names %>% 
  ggplot2::ggplot() +
  geom_bar(aes(as.character(stardate_decimal)), fill = "#CC0C00FF") +
  labs(
   title = "Decimals one to three are most frequent and zero the least frequent",
    x = "stardate decimal value"
  ) +
  theme_solarized_2(light = FALSE)

And in tabular form:

stardate_tidy_names %>% 
  janitor::tabyl(stardate_decimal) %>% 
  knitr::kable()
stardate_decimal n percent
0 3 0.0114068
1 41 0.1558935
2 58 0.2205323
3 37 0.1406844
4 20 0.0760456
5 24 0.0912548
6 18 0.0684411
7 29 0.1102662
8 17 0.0646388
9 16 0.0608365

9 Belay that!

How does this look across the seasons?

stardate_tidy_names %>% 
  ggplot2::ggplot() +
  geom_bar(
    aes(as.character(stardate_decimal)),
    fill= c(
      rep("#CC0C00FF", 10),
      rep("#5C88DAFF", 9),
      rep("#84BD00FF", 10),
      rep("#FFCD00FF", 9),
      rep("#7C878EFF", 10),
      rep("#00B5E2FF", 8),
      rep("#00AF66FF", 8)
    )
  ) +
  labs(
    title = "There's a similar(ish) pattern of decimal stardate frequency across seasons",
    x = "stardate decimal value"
  ) +
  facet_wrap(~ season) +
  theme_solarized_2(light = FALSE) + 
  scale_color_startrek()

And in tabular form:

stardate_tidy_names %>% 
  group_by(season) %>% 
  count(stardate_decimal) %>% 
  rename(count = n) %>% 
  DT::datatable(
    filter = "top",
    rownames = FALSE
  )

10 Speculate!

So stardates are more or less chronological across the duration of The Next Generation’s seven series, implying that the writers had a system in place. A few wobbles in consistency appear during the first few season suggest that it took some time to get this right. None of this is new information (see the links in the Open Channel! section below).

It seems the vast majority of episodes take place in the programme’s present with a few exceptions. We may have missed some forays through time simply because the stardate was unknown or unmentioned.

There appears to be some non-random pattern in the frequency of the digits 0 to 9 after the decimal place. Its not entirely clear if there is a reason for this within the universe of The Next Generation, but perhaps the writers put little thought to it and humans are bad at selecting random numbers anyway (relevant xkcd).

It turns out that this kind of investigation has been down before, buried in Section II.5 of [STArchive]’s stardate FAQ. I don’t know what method was used, but the exact results differ to the ones presented here. The basic pattern is similar though: few zeroes with 1, 2 and 3 being most common.

11 Open channel!

A selection of further reading:

12 Full stop!

sessionInfo()
## R version 3.4.3 (2017-11-30)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.3
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] ggthemes_3.4.0     ggsci_2.8          plotly_4.7.1      
##  [4] ggplot2_2.2.1.9000 DT_0.4.5           rvest_0.3.2       
##  [7] xml2_1.1.1         bindrcpp_0.2       janitor_0.3.0     
## [10] dplyr_0.7.4        stringr_1.2.0      purrr_0.2.4       
## [13] readr_1.1.1        emo_0.0.0.9000    
## 
## loaded via a namespace (and not attached):
##  [1] colorspace_1.3-2    htmltools_0.3.6     viridisLite_0.3.0  
##  [4] yaml_2.1.18         utf8_1.1.3          XML_3.98-1.9       
##  [7] rlang_0.2.0         pillar_1.2.1        glue_1.2.0         
## [10] withr_2.1.1.9000    selectr_0.3-1       bindr_0.1          
## [13] plyr_1.8.4          munsell_0.4.3       gtable_0.2.0       
## [16] htmlwidgets_1.0     evaluate_0.10.1     labeling_0.3       
## [19] knitr_1.18          httpuv_1.3.5        crosstalk_1.0.1    
## [22] curl_3.0            highr_0.6           Rcpp_0.12.15       
## [25] xtable_1.8-2        backports_1.1.1     scales_0.5.0.9000  
## [28] jsonlite_1.5        mime_0.5            hms_0.3            
## [31] digest_0.6.15       stringi_1.1.6       shiny_1.0.5        
## [34] rprojroot_1.2       grid_3.4.3          cli_1.0.0          
## [37] tools_3.4.3         magrittr_1.5        lazyeval_0.2.1     
## [40] tibble_1.4.2        crayon_1.3.4        tidyr_0.7.2        
## [43] pkgconfig_2.0.1     data.table_1.10.4-2 lubridate_1.7.2    
## [46] assertthat_0.2.0    rmarkdown_1.6       httr_1.3.1         
## [49] R6_2.2.2            compiler_3.4.3

  1. The star date for today’s date (14 April 2018) as calculated using the trekguide.com method; this “would be the stardate of this week’s episode if The Next Generation and its spinoffs were still in production”.